feat(bridge): crash-safe idempotency for withdraws and refunds#1096
Merged
Conversation
Adds a persistent idempotency store and startup reconciliation so a crash between submitting a Stellar payment and confirming it on TFChain is recovered without double-paying or double-confirming. Per-transaction submit is retained (batching is intentionally excluded/deferred). - pkg/idempotency.go: bbolt-backed store tracking PROCESSING/COMPLETED state per withdraw (by burn tx id) and refund (by tx hash); refuses to downgrade COMPLETED back to PROCESSING. Reset() wipes the store. - handleWithdrawReady / handleRefundReady: check the store first (skip if COMPLETED); on PROCESSING, look up Horizon for an existing outgoing tx and only complete the TFChain confirmation if found, otherwise mark PROCESSING, submit, confirm, then mark COMPLETED. The existing #1092 undeliverable-refund quarantine is preserved and now also marks the refund COMPLETED. - Withdraw recovery: withdraw payments are now tagged with the burn tx id as a Stellar text memo (traceability + recovery by memo), with the account sequence number as a fallback for pre-memo submissions. The memo is part of the signed tx and is set identically at both build sites (CreatePaymentAndReturnSignature + CreatePaymentWithSignaturesAndSubmit). Refund recovery uses the existing MemoReturn hash, sequence as fallback. - reconcilePendingTransactions: runs once at startup to recover entries left PROCESSING by a previous run, using one Horizon page for all lookups. - Event loop: process Ready events before Created/Expired (Ready submits time-sensitive Stellar signatures). - Chain-reset safety: the store is chain-scoped (burn tx ids restart after a reset), so it is wiped via Reset() when started with RescanBridgeAccount (the same flag that zeroes the Stellar cursor). Deployment note: the withdraw memo is part of the signed transaction, so all validators must run this version together. During a mixed-version rollout, withdraw submissions whose signature set spans both versions are rejected (tx_bad_auth) and postponed/retried — withdraws stall but do not crash or double-pay, and self-heal once all validators are upgraded. No runtime upgrade. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
06e8e11 to
70b133f
Compare
This was referenced Jun 3, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a persistent idempotency store + startup reconciliation so a crash between submitting a Stellar payment and confirming it on TFChain is recovered without double-paying or double-confirming. Per-transaction submit is retained — batching is intentionally excluded/deferred (see notes). Bridge-only, no runtime upgrade.
This is the crash-safety slice carved out of the larger #1079 draft (batching dropped; the 6 bug fixes split into #1094/#1095; #1092's quarantine preserved).
What it does
pkg/idempotency.go— bbolt-backed store trackingPROCESSING/COMPLETEDper withdraw (by burn tx id) and refund (by tx hash). Refuses to downgradeCOMPLETED→PROCESSING.Reset()wipes it.handleWithdrawReady/handleRefundReady— check the store first (skip ifCOMPLETED); ifPROCESSING, look up Horizon for an already-submitted payment and only complete the TFChain confirmation if found; otherwise markPROCESSING→ submit → confirm → markCOMPLETED. The fix(bridge): quarantine undeliverable refunds instead of crashlooping #1092 undeliverable-refund quarantine is preserved and now also marks the refundCOMPLETED.MemoReturnhash, sequence as fallback.reconcilePendingTransactions— runs once at startup to recover entries leftPROCESSINGby a previous run, using one Horizon page for all lookups.COMPLETEDentries → wrongly skipped withdraws), so it is wiped viaReset()when started withRescanBridgeAccount(the same flag that zeroes the Stellar cursor).Safety reasoning / regression verification (adversarially reviewed)
PROCESSINGis persisted before submit. Crash-after-submit → restart finds the existing tx (memo/sequence) and only confirms on TFChain. Crash-before-submit → no tx exists → safe re-submit. Even if a submitted tx scrolled out of the 200-record Horizon window, a re-submit reuses the already-consumed sequence → Stellar rejects withtx_bad_seq→ postpone. No second payment possible.COMPLETED. Set only after a successful on-chain confirm (or intentional quarantine forfeit). A submit failure leaves the entryPROCESSING.mint.go) reads incoming memos as the mint destination; withdraws are outgoing (account_debited,From==bridge) and the recovery page only queriessource_account=bridge, so a withdraw memo can never be read as a deposit. Withdraw (text) vs refund (return) memo types are disjoint.DeleteBucket+CreateBucketfor both buckets; mints are independently guarded on-chain byIsMintedAlready, so they don't need store tracking.MemoText28-byte limit — a uint64 is ≤ 20 digits. Safe.The withdraw memo is part of the signed transaction, so all validators must run this version together. During a mixed-version rollout, a withdraw whose collected signatures span memo + no-memo versions is rejected by Stellar with
tx_bad_authand postponed/retried — withdraws stall but do not crash or double-pay, and self-heal once every validator is upgraded. Roll out to the whole validator set together (or in a window where transient withdraw delays are acceptable).Testing
go build ./...,go vet ./...,gofmt -l— all clean.idempotency.gois a good follow-up unit-test target.Relationship to other PRs
🤖 Generated with Claude Code
Closes #1054, #1053